Finding motifs using DNA images derived from sparse representations

Abstract Motivation Motifs play a crucial role in computational biology, as they provide valuable information about the binding specificity of proteins. However, conventional motif discovery methods typically rely on simple combinatoric or probabilistic approaches, which can be biased by heuristics such as substring-masking for multiple motif discovery. In recent years, deep neural networks have become increasingly popular for motif discovery, as they are capable of capturing complex patterns in data. Nonetheless, inferring motifs from neural networks remains a challenging problem, both from a modeling and computational standpoint, despite the success of these networks in supervised learning tasks. Results We present a principled representation learning approach based on a hierarchical sparse representation for motif discovery. Our method effectively discovers gapped, long, or overlapping motifs that we show to commonly exist in next-generation sequencing datasets, in addition to the short and enriched primary binding sites. Our model is fully interpretable, fast, and capable of capturing motifs in a large number of DNA strings. A key concept emerged from our approach—enumerating at the image level—effectively overcomes the k-mers paradigm, enabling modest computational resources for capturing the long and varied but conserved patterns, in addition to capturing the primary binding sites. Availability and implementation Our method is available as a Julia package under the MIT license at https://github.com/kchu25/MOTIFs.jl, and the results on experimental data can be found at https://zenodo.org/record/7783033.


Introduction
Identifying conserved substrings in a set of unaligned DNA strings is a fundamental challenge in computational biology. These conserved substrings, known as motifs, emerge due to evolutionary forces, and are known to play a crucial role in regulatory mechanisms. Elucidating the regulatory motifs present in specific genomic regions, such as the promoters and enhancers, can shed light on gene regulation mechanisms and contribute to our understanding of biological processes. As such, accurately identifying motifs is an essential step toward understanding the complex interplay between genes and their regulation.
Motifs are often inferred from the representations of computational models applied to DNA strings. Traditionally, motifs are inferred directly, as the conventional methods typically depict the motifs as consensus strings (Li et al. 1999), product multinomials (Bailey andElkan 1995, Liu et al. 1995), or position weight matrices (Hertz and Stormo 1999, Heinz et al. 2010, Bailey 2021. While the traditional methods, e.g. STREME and HOMER, are efficient in identifying the primary motif, they often employ heuristics such as substring masking that turns the methodology into a sequential procedure. This methodological approach can make it challenging to discover secondary motifs in the dataset in a principled manner, as secondary motifs usually share identical patterns with the primary motifs. Moreover, these conventional methods frequently rely on k-mer enumeration to generate initial seeds for the optimization subroutine, constraining them to identify only ungapped motifs and limiting the maximum length of motifs that can be discovered. For example, STREME version 5.5.1 by default can only detect motifs that are up to 30 base pairs in length.
More recently, motifs are inferred from deep learning methodologies, epitomized by the use of convolutional neural network (CNN). Compared with traditional methods, inferring the motifs is less straightforward: some approaches identify the motifs as the filters in the first convolutional layers, while others use model agnostic explanation methods such as DeepLift or SHapley Additive exPlanations (SHAP) (Lundberg andLee 2017, Shrikumar et al. 2017). Furthermore, the filters in CNNs are convolved with DNA strings to create embeddings ( Fig. 1), which by construction are primarily intended for building regressors, e.g. predicting the bigWig coverage (Kelley et al. 2016, Avsec et al. 2021a,b, Yuan and Kelley 2022. Therefore, identifying all conserved patterns in datasets is typically not the main objective of CNNs. Consequently, the motifs predicted by CNNs are not significantly different from those predicted by traditional methods, and a principled, unifying model that effectively captures the long, gapped, and flexible motifs present in the dataset is still lacking.
In this work, we introduce a hierarchical sparse representation as a principled framework for motif discovery. Our method is capable of capturing statistically significant long, gapped, and flexible motifs in addition to the primary motifs in the dataset. These motifs are often longer than 30 base pairs in length, beyond the computable range for methods based on k-mer enumerations such as STREME and HOMER. In addition, as sparse representations are interpretable by design, we alleviate the need for explanation methods like SHAP. Our method can efficiently scale to large datasets containing hundreds of thousands of DNA strings, requiring only a moderate amount of computational resources.

Notation
We refer to DNA strings as strings defined on R ¼ fA; C; G; Tg. The i-th element of a vector v is denoted by v½i. Vectors and matrices are denoted by boldface letters, while scalars are non-bold. We use the notation x ! 0 to indicate that all components of the vector or matrix x are nonnegative. The function j Á j returns the cardinality of a finite set. The norms jj Á jj F ; jj Á jj 2 ; jj Á jj 1 , and jj Á jj 0 denote the Frobenius norm, ' 2 norm, ' 1 norm, and ' 0 norm, respectively.

Problem formulation
Our method begins with a simple idea: we induce a sparse representation of a DNA string by assuming that each onehot encoded DNA string s n can be represented as a finite sum of linear convolutions, i.e.
where d m represents the features, often called filters, and x mn is the corresponding sparse vector of s n , known as the sparse code (Bristow et al. 2013, Heide et al. 2015, Wohlberg 2016, Dumitrescu and Irofti 2018, Garcia-Cardona and Wohlberg 2018. We consider the filters can be reshaped as '-column position frequency matrices (PFMs) that capture the frequency of nucleotides at each position of a length-' DNA string, i.e.
which resolves the scaling ambiguity between the filters and the sparse code. A key insight we present is that the sparse code can be likened to images. We refer to such images as code images, and we generate one for each string s n by horizontally concatenating the sparse code x mn for all m ( Fig. 2A).
Using the set of DNA strings and their corresponding code images, we proceed to construct a sparse representation specifically tailored to represent these images (Fig. 2B). By doing so, we are able to identify a more extensive and diverse set of patterns in the DNA strings by enumerating the combinations within the sparse code. This technique, which we refer to as enumerating at the image level, yields a broader range of significant patterns than by enumerating the k-mers in the DNA strings.
To start the motif discovery of DNA strings s 1 ; . . . ; s N , we first tackle the following problem as a precursor to enumerating at the image level: In problem 2, our first goal is to approximate DNA strings by a sum of convolutions while promoting the sparsity in the code vectors x mn ; y mn . The filter d m is reversed to obtaind m , enabling the consideration of patterns in both the forward and complementary strands of DNA strings. Our second goal in problem 2 is to extract patterns at the image level (Fig. 2B). To be precise, the matrices X Án ; Y Án , and Z Án are constructed by horizontally concatenating the corresponding code vectors x 1n ; . . . ; x Mn ; y 1n ; . . . ; y Mn , and z 1n ; . . . ; z Kn , respectively. Similarly, the matrix ðX Án ; ; Y Án Þ is formed by concatenating X Án and Y Án horizontally. The transformed version of ðX Án ; Y Án Þ through a linear transform T is denoted as T ðX Án ; Y Án Þ. To ensure a parsimonious representation of the transformed image T ðX Án ; Y Án Þ, we constrain the matrix Z Án to be at most a-sparse and require the sum of convolutions In CNN, the filters are convolved with the one-hot encoded DNA string to generate an embedding for downstream purposes, e.g. predicting the bigWig coverage. In sparse representations, the filters are convolved with the sparse code, where the sparse code plays the role of an indicator, indicating where the filters should represent the DNA string.

2
Chu and Stormo P k F k Ã z kn to match T ðX Án ; Y Án Þ. Last, we impose a sparsity penalty on the filters F k so that they capture the most salient features in the images T ðX Án ; Y Án Þ.

Obtaining the sparse representation: the classical way
We can design iterative algorithms to solve problem 2 by alternatively minimizing the sparse code fx mn ; y mn ; z kn g and the filters fF k ; d m g. To do so, we define L Án and R Án as the left and right halves of the image P k F k Ã z kn , respectively, and for simplicity, let T be the identity transform. With the filters fF k ; d m g held fixed, we apply Alternating Direction Method of Multipliers (ADMM) (Boyd 2010) to problem 2 to solve the sparse code fx mn ; y mn ; z kn g: where L mn ; R mn are the m-th column of L Án and R Án , respectively. The scaled dual variables are C Án and ! Án , and the penalty parameter is q. The function 1 a ðÁÞ is an indicator function that projects the input matrix into the space of matrices with at most a non-zero elements. The solution to equation (3) can be obtained by alternatively solving x mn and y mn using iterative shrinkage thresholding algorithm (ISTA), i.e. for all m, n, where~is the cross correlation, S þ kg t ðÁÞ is the non-negative soft-threshold operator, and g t the step size at t. We consider projected gradient descent to solve equation 4. Here, an update step for all n is where Project a ðÁÞ keeps at most a largest magnitude components and zeros out the rest, G Z the gradient of the penalty term of equation (4), and c t is the step size at t. On the other hand, by applying block coordinate descent with method of multipliers to problem 2 with the sparse code fx mn ; y mn ; z kn g held fixed, we obtain the following iterative algorithm to solve the filters fF k ; d m g: Figure 2. We show how we enumerate at the image level, with a single one-hot encoded DNA string as the input: (A) we first obtain a sparse representation of the one-hot encoded DNA string as a sum of linear convolutions. Next, we concatenate the collection of the sparse code in this sparse representation as an image. We refer to this image as a code image. (B) We obtain a sparse representation of the code image. Using the sparse representation of the code image, we enumerate the configurations within the sparse code.
Finding motifs using DNA images derived from sparse representations 3 where s is the penalty term and H Án are the scaled dual variables. We can solve equation (8) via mirror descent (Beck and Teboulle 2003). Because we can reshape each d m into a '-column PFM by assumption, a way to express each component of d m is d m ½4ðj 1 À 1Þ þ j 2 for j 1 ¼ 1; . . . ; '; j 2 ¼ 1; . . . ; 4. We define the distance function w associated with the mirror descent as the sum of the negative entropy of each column of the reshaped filter d m . This function is expressed as: Therefore, to update each filter d m with all components indexed by j 1 and j 2 in the mirror descent, we have the following expression: with p t the step size at t, and g t m the gradient of d m in equation (8). We solve equation (9) by ISTA for each filter F k : and then normalize so that each kF k k F ¼ 1.
Here, x t is the step size at t and S þ lx t ðÁÞ is the non-negative soft-threshold operator.

Obtaining the sparse representation by deep unfolding
Rather than relying on the algorithm from Section 2.3 to obtain the sparse representation, we adopt a different approach called deep unfolding (Gregor andLeCun 2010, Monga et al. 2021). The deep unfolding approach employs a neural network that parameterizes the iterates of an iterative algorithm as its forward pass, which obtains an approximate representation of the problem and has been shown to achieve significantly faster convergence in practice (Gregor and LeCun 2010).
We use deep unfolding to obtain an approximate sparse representation in problem 2. To construct our network, we use the iterates from Section 2.3, with the sparse code fx mn ; y mn ; z kn g and the scaled dual variables fC Án ; ! Án ; H Án g in the network initialized to be zero. Next, the equations (6), (7), (11), and (12) can be implemented as non-linearities of the layers in the network. We construct the first 2 Â K 1 layers by interleaving the iterates of equations (6) and (7), and the subsequent 2 Â K 2 layers by executing the iterates of equations (11) and (12). The parameters of the our network are the filters fd m ; F k g, the sparsity inducing parameters fk; lg, the penalty parameters fq; sg, and the stepsizes fg t1 ; c t1 ; p t2 ; x t2 : t 1 ¼ 1; . . . ; K 1 ; t 2 ¼ 1; . . . ; K 2 g, learned by training the network with backpropagation.
Once the network completes its forward pass, we define the loss of the network as where the sparse code fx K 1 mn ; y K 1 mn ; z K1 kn g are from the final output of the first 2 Â K 1 layers, and the filters fd K2 m ; F K2 k g are from the final output of the subsequent 2 Â K 2 layers of the network. An illustration of the unfolded network is shown in Fig. 3.

Enumerating at the image level
After the network is trained, we can retrieve the sparse code of the code images, i.e. the set fZ Án g by a network forward pass. Similar to the seeding phase in methods such as STREME and HOMER, our motif discovery also performs enumerations. Yet, instead of enumerating the oligonucleotides, we enumerate the combinations of the sparse code components in each Z Án , for n ¼ 1; . . . ; N. Specifically, we count how q 2 Z þ components are arranged in each Z Án . Given that Z Án has at most a components, the number of operations required to enumerate all possible combinations of q components in each Z Án is at most a q . This is a relatively small number, especially considering that we restrict both a and q to be small integers. We refer to each combination of q components and its spatial information a configuration. The configuration is defined as a tuple: where the numbers f ð1Þ ; f ð2Þ ; . . . ; f ðqÞ are the q filter-indices of filters fF k g directly inferred from Z Án in the combination, sorted by their spatial occurrences in the DNA string s n . The numbers d ð1Þ ; d ð2Þ ; . . . ; d ðqÀ1Þ are the distances (number of nucleotides) in between their neighboring components in the configuration. We store the configurations as keys in a hash table H. The value associated with each key in H is a collection of DNA strings of the same length, each obtained by taking the DNA string in the region covered by the particular configuration, as we perform the enumerations through DNA strings s 1 ; . . . ; s n . Thus, each set of the DNA strings associated with a key defines a multiple sequence alignment (MSA), and corresponds to a position weight matrix (PWM).
We select up to J most enriched PWMs in the hash table H as the motifs. As multiple keys in H may correspond to similar motifs, we reduce such redundancy by merging the similar PWMs using the average log-likelihood ratio (Wang and Stormo 2003). We report the resulting PWMs P 1 ; . . . ; P J 0 as the discovered motif in the dataset.

Soft versus hard clustering representation of the motifs
Since a PWM represents an average and proteins can have distinct binding preferences, the motif discovery problem is, in a sense, similar to the clustering problem. Soft-clustering scenarios, such as mixtures of Gaussians, allow a point to belong to multiple clusters, exhibiting characteristics that coexist in each of them. This situation frequently arises in motif discovery, where, for instance, the primary motif may frequently occur by itself but occasionally appears twice in certain DNA strings in the dataset.
In this work, we present our result in the soft-clustering representation. Unlike most traditional motif discovery methods that use hard clustering representations, which use PWMs composed of mutually exclusive DNA substrings, our method employs soft clustering representations that allow PWMs derived by multiple sequence alignment to share DNA substrings with other PWMs. We provide a detailed comparison of both approaches and their trade-offs in Supplementary File S1 Section C.

Hyperparameters
We implement our model with M ¼ jfd m gj ¼ 50 filters for the sparse representation of DNA strings fs N g, where each filter (a PFM) d m covers ' ¼ 8 nucleotides. Additionally, we use K ¼ jfF k gj ¼ 24 filters for code images, where each filter F k can cover 12 Â ð2 Â MÞ pixels. To limit the number of nonzero elements in each code image Z Án , we set a to 32. We set the operator T as a scaling operator, i.e. T ðMÞ ¼ bM, where b ¼ 100 to ensure numerical stability. Our unfolded network is trained using AdaBelief (Zhuang et al. 2020), with a batch size of 6 DNA strings. We set K 1 to 6 interleaved iterations on the sparse code and K 2 to 3 iterations on the filters. This parameterization results in a network, i.e. Fig. 3, consisting of 30, 421 parameters, which is much smaller in size compared with current mainstream deep learning models. We set q ¼ 3 for the configurations, which results in at most a q ¼ 4; 960 counting operations for each retrieved code image Z Án . We select up to J ¼ 1000 most enriched motifs from the stored enumerations in the hash table H to display in the results.

Motif significance
We randomly select DNA strings and set aside 15% of them as a test set that is not used during training. For each motif discovered and indexed by j, we follow this procedure: Let N T denote the total number of available positions in the test set, where N t is the number of DNA strings in the test set, and L is the length of each string. We define N T ¼ N t L. The number of hits of the j-th motif in the test set is denoted by s h , and the number of misses is denoted by s m ¼ N T À s h . A hit at a position is defined as the positions occupied by the PWM that scored above its threshold. The score threshold for each PWM is determined using the approximation algorithm pvalue2score with a P-value of 1eÀ3 (Touzet and Varré 2007). Hits and misses for the control set are similarly defined as c h and c m , respectively. The control set is constructed by shuffling the DNA strings in the test set such that the 2-mer frequency is preserved. We then perform Fisher's exact test to test the null hypothesis that the odds ratio ðs h =s m Þ=ðc h =c m Þ is one, against the alternative hypothesis that they are not equal.

Motif occurrences
We define the number of instances of a motif as the number of position that a PWM scores above its score threshold, i.e. the number of hits (Section 2.7.2) in the dataset.

Results
We take experimental datasets from JASPAR, FactorBook, ReMap, and Avsec et al. (Weirauch et al. 2014, Avsec et al. 2021b, Castro-Mondragon et al. 2022, Hammal et al. 2022, Pratt et al. 2022 to conduct motif analysis, for which we detailed our data processing steps in Supplementary File S1 Section A. In our analysis, we characterize all the motifs as position weight matrices (PWMs). All motifs presented in this section (motifs shown in Figs 4-6) are statistically significant with a P-value <1e À 6, as determined by the Fisher exact test (Section 2.7.2). Additionally, we make the following characterizations: • Primary motif: the short (often 6-12bp), ungapped, PWM that correspond to the core binding sites of a transcription factor (TF). Finding motifs using DNA images derived from sparse representations • Long motif: the long (often longer than 30bp) PWM with uniformly high information content columns, e.g. Fig. 4; these motifs, like gapped motifs, can also characterize multiple binding sites of the same or different TFs. • Gapped motif: the PWM that contain groups of consecutive high information content columns separated by low information content columns, e.g. PWMs in Fig. 6.
Our analysis of the datasets reveals a large presence of long motifs and gapped motifs in these experimental datasets. In particular, among the 91 ChIP-Seq datasets we selected from JASPAR, 50 of them contain long motifs that are transposable elements, as we verified in the Dfam database (Hubley et al. 2016) (Supplementary File S1 Section E). Moreover, gapped motifs were identified in many datasets from diverse sources, including 89 out of 198 datasets in JASPAR across various experiment types including ChIP-seq, DAP-seq, SELEX, PBM, and ChIP-Chip. This large presence of long motifs and gapped motifs in our analysis is noteworthy, as they are often overlooked in motif discovery and may have important implications in transcriptional regulation. In the following sections, we will highlight the discovery of gapped and long motifs. We skip the results on primary motif discovery as our method almost always find them (Supplementary File S1 Section F), and it is straight-forward problem tackled by many wellestablished methods.

Long motifs
A central theme that frequently occurs in long motif discovery is the overlap between the long motifs and the primary motifs. These overlappings present a significant challenge for traditional motif discovery methods, discussed in Section 4.1. In qualitative terms, the primary motifs can either be "embedded" within the long motifs or "compounded" in proximity to other motifs in the dataset.

"Embedded" motifs
Our method simultaneously discovers the primary motifs and the long motifs that embed these primary motifs, as shown in Fig. 4. We observe many such cases, especially in ChIP-Seq datasets, where TF binding sites in vivo can often overlap with repetitive elements. Our result reveals a relationship between primary motifs and repetitive elements, and suggests that the use of repeat masking is not strictly necessary for motif discovery.

"Compound" motifs
Our analysis shows that compound motifs are frequently seen in ChIP-Seq datasets. These findings suggest that our method effectively identifies binding sites including those that work in conjunction with the TF of interest, as TFs often work in concert with other TFs to regulate gene expression in vivo. We show in Fig. 5 several compound motifs that we identified in a ChIP-Nexus experiment exploring the localization of Oct4 pluripotency factor, using experimental data from (Avsec et al. 2021b).

Gapped motifs
Our method detects several gapped motifs that have been previously reported in the literature, shown in Fig. 6. For instance Zuo et al. (2023) report that traditional motif discovery methods have been shown to underestimate the binding sites of CTCF, a zinc finger protein containing 11 zinc finger domains, resulting in a paradox known as long fingers but short motifs. This paradox highlights that the full binding sites of evolutionarily conserved zinc finger domains may follow an irregular structure that is not easily detected by traditional methods. Our method successfully identifies the upstream motif TGCAG of the core binding sites of CTCF, as reported in (Zuo et al. 2023), shown in Fig. 6A. In addition, we confirm that ESR2, a Nuclear Receptor factors binds with its half sites AGGTCA (Khorasanizadeh and Rastinejad 2001, Siggers and Gordâ n 2014), which we found to be separated by a spacer up to 36 nucleotides, shown in Fig. 6B.
Notably, our method exhibits high sensitivity in quantifying the spacers within gapped motifs, providing detailed insights into gapped motifs' composition. For example, we identified a gapped motif in the MAFF factor from JASPAR that exhibits a gap between the primary motif and its partial complement, with a total of 26 spacers, shown in Supplementary File S1 Section D.

Distributed representation of DNA strings using sparse representation
A key distinction between traditional and more recent approaches to motif discovery is in how motifs are represented during optimization. Traditional methods, such as STREME and HOMER, typically use local representations, such as PWMs, for motif characterization during optimization. In contrast, recent approaches often use deep learning that leverage distributive representations to learn and represent motifs (Hinton et al. 1986). The choice between the two types of representation involves a trade-off between interpretability and expressiveness, with local representations being easier to interpret. However, problems that characterize motifs with local representations has generally been shown to be NP-hard (Bafna et al. 1997, Li et al. 1999, Akutsu et al. 2000. Due to this, common heuristics, such as substring masking, are often used during optimization to find motifs, resulting a sequential procedure for motif discovery (Bailey and Elkan 1995, Heinz et al. 2010, Bailey 2021. This sequential procedure may pose challenges as primary motifs can overlap with secondary motifs, including gapped motifs and transposable elements present in the dataset. Our result demonstrates the existence of these secondary motifs, and there are simulated experiments indicate that methods such as STREME and HOMER are less effective when multiple motifs overlaps (Chu and Stormo 2022).
We select sparse representations for DNA strings as they provide a distributive representation with a clear interpretation. By approximating DNA strings as a sum of linear convolutions (equation (1)), the non-zero components in the sparse code essentially act as indicators, indicating where filters should be used to represent DNA substrings ( Fig. 2A). The combined sparse code, which we refer to as code images, provides a high-level view on the spatial arrangements of the filters, which we build another sparse representation upon (Fig. 2B). This sparse representation on the code images  (Avsec et al. 2021a), such as Oct4-Sox2, Oct4-Oct4, and Zic3. (D) Our analysis revealed new insights into the Oct4-Oct4 motif, including the potential for variable spacing between the half-sites TGCA, ranging from 1 to 8 nucleotides. (E, G, C) We also observed that Oct4 can co-occur up to 4 times with equal-length spacers of TATG, and it is frequently associated with Zic3 factors. (F, H, I, J, K, L) Furthermore, we identified two minor variants of Oct4, which we designate as Oct4 0 and Oct4 00 , due to the insertion of a nucleotide A either upstream or downstream of the central nucleotide G in Oct4. Our additional findings for Sox2, Klf4, and Nanog pluripotency factors from (Avsec et al. 2021a) are presented in Supplementary File S1 Section F.
Finding motifs using DNA images derived from sparse representations enables us to identify the conserved patterns in the dataset, akin to enumerating k-mers at the nucleotide level. Yet, by enumerating at the image level, the spatial relationship in between the regulatory elements is preserved, in contrast to kmers enumerations, which do not account for the spatial information.

Comparison to recent deep learning methodologies in regulatory genomics
As with recent work on deep learning for regulatory genomics (Alipanahi et al. 2015, Avsec et al. 2021b, Yuan and Kelley 2022, our approach use distributive representations to characterize patterns in DNA strings. Additionally, our method incorporates the design of a neural network (Fig. 3). However, our network design process is distinct from standard deep learning practices. Rather than using traditional design tools, we create the network architecture by unfolding the iterative algorithm detailed in Section 2.3. The unfolding technique incorporates hyperparameters, such as sparsityinducing parameters, step-sizes, and penalty parameters from problem 2, as part of the network architecture, which are automatically tuned with backpropagation (Gregor and LeCun 2010, Monga et al. 2021). This results in a more expressive model compared with the original. We note that our network is fully interpretable as the forward pass can be seen as optimizing the objective posed by problem 2 via gradient descent. Inferring the motifs relies on enumerating at the image level (Section 2.5 and Fig. 2), without relying on explanation methods such as SHAP (Lundberg and Lee 2017). Our network is light in parameters (Section 2.7.1), allowing us to quickly train and infer motifs in a matter of minutes using a single GPU, and only requires DNA strings (FASTA files) as inputs.

Extensions to other regulatory genomics problems
Our method produces a computational graph (Fig. 3), which permits us to easily extend it to other regulatory genomics problems, such as DNA classifications or regressions. To accomplish this, one can treat the sparse code as an embedding, analogous to the embeddings constructed in the CNNs (Fig. 1). Furthermore, with the filters in problem 2 held fixed, the sparse code provides an alternative way of representing the binding sites. This alternative approach could be valuable for estimating the recognition code, e.g. designing C2H2 zinc- Figure 6. We present a selection of gapped motifs, chosen from the many we discovered in each dataset (labeled with TF/GEO accession number or TF/ JASPAR-ID). (A) Notably, we discovered the presence of the upstream motif CTGCAG of CTCF (Zuo et al. 2023); on the right, we note that the upstream motif of the core motif occur independently of the primary motif of CTCF. (C) Our analysis reveals that the primary motif of SP1 and ESR2 exhibits repetition with spacers of varying lengths. We note that ESR2 variable spacing correspond to the findings reported in (Siggers and Gordâ n, 2014). (D, E, F) In the case of STAT1, we observe a fixed spacing across all three datasets. (G) We present a gapped motif discovered in MAFF, which demonstrates our method's capability to capture gapped motifs with spacers longer than 40 bp. To avoid clutter, we display only motifs with the smallest, middle, and largest spacers, as determined by their pairwise distance (number of nucleotides) for ESR2, SP1, and MAFF. A full result of our motif discovery is in Supplementary File S1 Section F.

Future work
Our method effectively captures the gapped motifs (Figs 5 and 6), but highly varying spacers are common in these motifs. For instance, a gapped motif in BZIP MA0495.1 from JASPAR appears with 26 different spacer lengths (Supplementary File S1 Section D), resulting in numerous PWMs in our results. We intend to improve motif summarization and visualization in the future.

Conclusion
We present a motif discovery method that includes the discovery of long, gapped, or overlapping motifs. Our key concept, enumerating at the image level, enables a more extensive and diverse pattern identification in DNA strings compared with enumerating the k-mers. Our study demonstrates that the sparse representation is a highly interpretable modeling technique for DNA strings. This approach enables us to reveal that transposable elements and gapped motifs are common in ChIP-Seq datasets from JASPAR, Factorbook, and Remap (Castro-Mondragon et al. 2022, Hammal et al. 2022, Pratt et al. 2022). Our methodology is compatible with the deep learning paradigm through deep unfolding, enabling us to extend it to various regulatory genomics problems.